POLI 572B

Michael Weaver

March 15, 2024

Regression Interpretation (2)

Objectives

Recap

  • LS is best linear approximation of CEF
  • Intercept and slope interpretations
  • LS can fit group means (dummy variables)

Interpretation

  • Coefficients w/ Controls

Recap

Least Squares and CEF

Least Squares predictions \(\widehat{Y} = X'\beta\) gives the linear approximation of \(E(Y_i | X_i)\) that has the smallest mean-squared error (minimum distance to the true \(E(Y_i | X_i)\)).

  • It is “best” linear approximation in that the overall set of predictions \(\hat{y}\) has least distance to \(y\)
  • Least Squares could equivalently fit a line for the mean of Y: \(E(Y) | X\) rather than prediction for each \(Y_i\). We could use \(E(Y_i | X_i)\) as the dependent variable

Least Squares and CEF

Alternatively, least squares is a particular weighted average of derivative of non-linear CEF (board)

  • weights greater closer to median of \(x\)

Linearity in Least Squares

Least Squares is linear in parameters

What does this mean?

  • \(\mathbf{X\beta}\) means we multiply values of \(x\) by coefficients, then add them

\[\mathbf{X}\beta = \begin{pmatrix} 1 & x_{11} & x_{21} \\ \vdots & \vdots & \vdots \\ 1 & x_{1n} & x_{2n} \end{pmatrix} \begin{pmatrix} b_0 \\ b_1 \\ b_2 \end{pmatrix} = \begin{pmatrix} \hat{y_1} \\ \vdots \\ \hat{y_n} \end{pmatrix} = \hat{Y}\]

Least Squares and Group Means

Least Squares can predict group means when using binary variables:

  1. Despite different coefficients, least squares can make identical predictions \(\hat{y}\)

  2. We need to specify design matrix/model to get estimates of interest. (e.g. difference in means)

  3. If we have exclusive indicator variables for belonging to distinct categories (sometimes called “dummy variables”), must drop one group if we fit an intercept. (Why?)

  4. If we fit group means: residuals are deviation from group means (centered at 0 within each group).

Least Squares with Controls

Interpreting w/ Controls

So far, we have considered fitting group means.

  • indicator/dummy variables have been for exclusive membership in groups
  • what happens if we control for membership in different groups?

Interpreting w/ Controls

We want to estimate differences in earnings across gender, but we want to control for the sector of employment.

In this data, we only have doctors and lawyers, so we can estimate the following:

\(Earnings_i = b_0 + b_1 \ Female_i + b_2 \ Medicine_i\)

Where \(Female_i\) is \(1\) if person is female, \(0\) if male. \(Medicine_i\) is \(1\) if they are a doctor \(0\) if they are a lawyer.

Interpreting w/ Controls

  1. Linear algebra showed us \(b_1\) tells us relationship between \(Female_i\) orthogonal to \(Medicine_i\) and intercept, and \(Earnings\)
  • What is residual \(Female_i^*\) in this case? (Think about residuals in models with group dummies above)
  • Residual \(Female_i^*\) will be deviation from mean level of \(Female\) in law and medicine

Interpreting w/ Controls

  1. Slope is \(\frac{Cov(x^*, y)}{Var(x^*)}\): Variance of residuals might differ across \(Medicine_i\).
  • What would happen to \(Var(Female^*)\) among doctors if ALL doctors were female?
  • No variation left - no observations differ from the mean of \(1\), do not contribute to the slope (zero weight)

Example

Let’s now find the (approximate) linear conditional expectation function of earnings, across gender and profession:

ls = lm(INCEARN ~ FEMALE + MEDICINE , acs_data)

Example

## 
## Call:
## lm(formula = INCEARN ~ FEMALE + MEDICINE, data = acs_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -226109  -93592  -35873   81891  762891 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   176243       2034   86.66   <2e-16 ***
## FEMALE        -55370       2824  -19.61   <2e-16 ***
## MEDICINE       55866       2670   20.93   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 132700 on 9997 degrees of freedom
## Multiple R-squared:  0.07833,    Adjusted R-squared:  0.07814 
## F-statistic: 424.8 on 2 and 9997 DF,  p-value: < 2.2e-16

How do we make sense of, e.g., the slope on FEMALE?

Example

Example

Example

Let’s now find the (approximate) linear conditional expectation function of earnings, across gender and hours worked:

We’re not controlling for a binary indicator, but a continuous variable.

ls = lm(INCEARN ~ FEMALE + UHRSWORK , acs_data)

Interpreting w/ Controls

  1. Linear algebra showed us \(b_1\) tells us relationship between \(Female_i\) orthogonal to \(Hours_i\) and intercept, and \(Earnings\)
  • What is the residual/orthogonal \(Female_i^*\) in this case? ()
  • Residual \(Female_i^*\) will have perfectly \(0\) covariance/correlation with \(Hours_i\)

Example

## 
## Call:
## lm(formula = INCEARN ~ FEMALE + UHRSWORK, data = acs_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -243715  -98889  -38988   90987  796111 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 136821.6     6268.7   21.83   <2e-16 ***
## FEMALE      -53694.3     2886.5  -18.60   <2e-16 ***
## UHRSWORK      1241.4      115.3   10.77   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 134800 on 9997 degrees of freedom
## Multiple R-squared:  0.04898,    Adjusted R-squared:  0.04879 
## F-statistic: 257.4 on 2 and 9997 DF,  p-value: < 2.2e-16

How do we make sense of, e.g., the slope on FEMALE?

Example

Gender and age orthogonal but NOT independent

Interpreting w/ Controls

  1. Slope is \(\frac{Cov(x^*, y)}{Var(x^*)}\): Variance of residuals might differ across \(Hours_i\).
  • What happens to \(Var(Female^*)\) for values of hours worked where \(Hours_i\) badly predicts \(Female_i\)?
  • Greater variance! Increased weight on those observations

Example

Interpreting w/ Controls

What could we do if we really wanted to “hold hours worked constant”?

  • Compare earnings by gender within groups where hours are the same (think about our indicator variables from earlier)?

  • We can ensure that gender is exactly unrelated to each year of age by fitting an intercept for each age.

  • In this case, we have 10000 observations, so this is doable. Not always so lucky

Non-Linearity with Least Squares

Non-linearity with Least Squares

If the the CEF, \(E[Y|X]\) is not linear in \(X\), we can still use least squares to model this relationship.

The easiest choice is to use a polynomial expansion of \(X\).

If a straight-line relationship between \(X\) and \(Y\) is clearly wrong, we can model a “U”-shape by adding a squared term of \(X\):

\(Earnings_i = b_0 + b_1 Female + b_2 Hours + b_3 Hours^2\)

It is “linear” in that we still multiply values of \(X\) by \(\beta\) and sum, but we use non-linear transformations of the data.

Can we do better?

m = lm(INCEARN ~ FEMALE + poly(UHRSWORK, 3, raw = T) , acs_data)
summary(m)$coefficients
##                                  Estimate   Std. Error    t value     Pr(>|t|)
## (Intercept)                 -5.546584e+05 8.532942e+04  -6.500201 8.404065e-11
## FEMALE                      -4.987018e+04 2.870964e+03 -17.370536 1.305089e-66
## poly(UHRSWORK, 3, raw = T)1  3.302792e+04 4.423739e+03   7.466063 8.952876e-14
## poly(UHRSWORK, 3, raw = T)2 -4.500185e+02 7.373786e+01  -6.102950 1.079941e-09
## poly(UHRSWORK, 3, raw = T)3  1.934612e+00 3.946976e-01   4.901505 9.660031e-07

Non-linearity with Least Squares

We can incorporate all kinds of non-linearity:

  • polynomials (up to \(n-1\) degrees)
  • logarithms (natural log very common)
  • exponents
  • square-roots

But there are trade-offs…

Non-linearity with Least Squares

  1. Choice of non-linearity may be arbitrary:
  • unless we have good theoretical motivation, which non-linearity do we choose?
  • how do we decide which non-linear function is “best”?
  • worst case scenario: we can fit to arbitrary precision.

Linear fit

Perfect Polynomial fit…

Perfect Polynomial fit?

Non-linearity with Least Squares

We risk overfitting the data, leading to very bad extrapolations/interpolations.

  1. Choice of non-linearity:
  • In general, polynomials not a great idea
  • natural splines fit cubic polynomials in pieces (better)
  • machine learning tools (GAM, KRLS) fit arbitrary functions
  • But we have to cross-validate to assess out-of-sample predictions

Non-linearity with Least Squares

  1. Non-linearity makes interpreting coefficients difficult.
  • If we include \(x\) and \(x^2\), we can’t directly interpret these
  • Non linearity means slope is not constant.
  • Coefficients do not summarize the average slope
  • We need to calculate average partial derivative/average partial effect

See Aronow and Miller, Chapter 4.3.4

Basically, if you non-linearly transform \(Y\) or \(X\), you need to check how to interpret.

Key Takeaways:

  • If we ‘control’ for binary variables, we can exactly “hold things constant”.
    • this is only possible when \(n >> g\): many more observations than ‘groups’ we want to hold constant
  • If we ‘control’ for continuous variables, least squares only ensures orthogonality between variable of interest and control.
    • orthogonality \(\neq\) independence
    • this ensures only zero linear correlation.
  • We can split the difference w/ non-linear transformations
    • how to choose? how to avoid overfitting?
    • difficult to interpret coefficients for transformed variables.

Conditioning

Recall:

When there is possible confounding, we want to “block” these causal paths using conditioning

In order for conditioning to estimate the \(ACE\) without bias, we must assume

\(1\). Ignorability/Conditional Independence: within strata of \(X\), potential outcomes of \(Y\) must be independent of cause \(D\) (i.e. within values of \(X\), \(D\) must be as-if random)

  • all ‘backdoor’ paths are blocked
  • no conditioning on colliders to ‘unblock’ backdoor path

In order for conditioning to estimate the \(ACE\) without bias, we must assume

\(2\). Positivity/Common Support: For all values of treatment \(d\) in \(D\) and all value of \(x\) in \(X\): \(Pr(D = d | X = x) > 0\) and \(Pr(D = d | X = x) < 1\)

  • There must be variation in the levels of treatment within every strata of \(X\)

In order for conditioning to estimate the \(ACE\) without bias, we must assume

  1. No Measurement Error: If conditioning variables \(X\) are mis-measured, bias will persist.
  • We’ll revisit this in the context of regression

Two Ways of Conditioning:

  • imputation/matching: “plug in” unobserved potential outcomes of \(\color{red}{Y(1)}\) (\(\color{red}{Y(0)}\)) using observed potential outcomes of \(Y(1)\)(\(Y(0)\)) from cases with same/similar values of \(\mathbf{X_i}\).
  • reweighting: reweight members of “treated” and “untreated” groups based on \(Pr(D_i | \mathbf{X_i})\)

Conditioning by Imputation

We find effect of \(D\) on \(Y\) within each subset of the data uniquely defined by values of \(X_i\).

\(\widehat{ACE}[X = x] = E[Y(1) | D=1, X = x] - E[Y(0) | D=0, X = x]\)

for each value of \(x\) in the data.

Under the Conditional Independence Assumption, \(E[Y(1) | D=1, X=x] = \color{red}{E[Y(1) | D=0, X=x]}\) and \(E[Y(0) | D=0, X=x] = \color{red}{E[Y(0) | D=1, X=x]}\)

We impute the missing potential outcomes… with the expected value of the outcome of observed cases with same values of \(x\), different values of \(D\).

Conditioning by Imputation

what have we seen that looks like these values: e.g., \(E[Y(0) | D=0, X=x]\)?

  • this a conditional expectation function; something we can estimate using regression

A Causal Model for Least Squares

To understand how we can use regression to estimate causal effects, we need to describe the kinds of causal estimands we might be interested in.

Response Schedule

In the context of experiments, each observation has potential outcomes corresponding to their behavior under different treatments

  • To estimate \(ACE\) of treatment without bias, assumed that treatment status is independent of potential outcomes

Response Schedule

In regression, where levels of treatment might be continuous, we generalize this idea to the “response schedule”:

  • each unit has a set of potential outcomes across every possible value of the “treatment”
    • \(Y(0), Y(1), Y(2), \dots, Y(D_{max})\)
  • we can summarize the causal effect of \(D\) in two ways:

average causal response function:

  • what is \(E[Y(d)]\) for all \(d \in D\), or what are mean potential outcomes of \(Y\) at values of \(D\).
  • this is a conditional expectation function, defined in terms of potential outcomes

average partial derivative:

  • the average slope of the (possibly non-linear) average causal response function
  • what is the average causal change in \(Y\) induced by \(D\), averaged over all values \(d\) in \(D\)
  • if there are just two levels, the difference between them is the \(ACE\)

Response Schedule

We can use regression to estimate the linear approximation of the average causal response function:

\[Y_i(D_i = d) = \beta_0 + \beta_1 D_i + \epsilon_i\]

Here \(Y_i(D_i = d)\) is the potential outcome of case \(i\) for a value of \(D = d\).

  • this response schedule says that \(E[Y(D = d)]\), on average, changes by \(\beta_1\) for a unit changes in \(D\) (or a weighted average of non-linear effects of \(D\)… the average partial derivative)
  • \(\epsilon_i\) is unit \(i\) deviation from \(Y_i(D=d) - E[Y(D = d)]\)

Response Schedule

If we don’t know parameters \(\beta_0, \beta_1\), what do we need to assume to obtain an estimate \(\widehat{\beta}_1\) that we can give a causal interpretation? (On average, change in \(D\) causes \(\widehat{\beta}_1\) change in \(Y\))

We must assume

  • \(Y_i\) actually produced according to the response schedule (equation is correctly specified; e.g., linear and additive)
  • \(D_i\) is independent of \(\epsilon_i\): \(D_i \perp \!\!\! \perp \epsilon_i\).

In this scenario, if \(D\) were binary and we had randomization, this is equivalent to estimating the \(ACE\) for an experiment.

Response Schedule

If we want to use regression for conditioning, then the model would look different:

\[Y_i(D_i = d, X_i = x) = \beta_0 + \beta_1 D_i + \mathbf{X_i\beta_i} + \epsilon_i\]

  • here we add conditioning variables \(\mathbf{X_i}\) to the model.
  • regression is estimating the conditional expectation function \(E[Y(d) | D = d, X = x]\): this is what we need for conditioning

\[Y_i(D_i = d, X_i = x) = \beta_0 + \beta_1 D_i + \mathbf{X_i\beta_i} + \epsilon_i\]

Given what we have learned about regression so far…

  • How is using regression to estimate \(E[Y(d) | D = d, X = x]\) different from when we exactly matched cases?
  • How do the assumptions required for conditioning translate to the regression context?

Conditioning with Regression

How it works:

Imagine we want to know the efficacy of UN Peacekeeping operations (Doyle & Sambanis 2000) after civil wars:

We can compare the post-conflict outcomes of countries with and without UN Peacekeeping operations.

To address concern about confounding, we condition on war type (non/ethnic), war deaths, war duration, number of factions, economic assistance, energy consumption, natural resource dependence, and whether the civil war ended in a treaty.

122 conflicts… can we find exact matches?

Without perfect matches on possible confounders, we don’t have cases without a Peacekeeping operation that we can use to substitute for the counterfactual outcome in conflicts with a Peacekeeping force.

We can use regression to linearly approximate the conditional expectation function \(E[Y(d) | D = d, X = x]\) to plug in the missing values.

\[Y_i = \beta_0 + \beta_D \cdot D_i + \mathbf{X_i\beta_X} + \epsilon_i\]

Copy the data, ds2000, from here: https://pastebin.com/bAcmrPdN

  1. Regress success on untype4, logcost, wardur, factnum, trnsfcap, treaty', develop, exp, decade, using lm(), save this as m
  2. Use model to predict counterfactual potential outcomes for each conflict:
  • Create a copy of ds2000, and flip the value of untype4 (0 to 1, 1 to 0).
  • Use the predict(m, newdata = ds2000_copy) to add a new column called y_hat to ds2000.

Next:

  1. Create a new column y1, which equals success for cases with untype4 == 1, and yhat for cases with untype4 == 0
  2. Create a new column y0, which equals success for cases with untype4 == 0, and yhat for cases with untype4 == 1
  3. Calculate tau_i as the difference between y1 and y0. Then calculate the mean tau_i.
  4. Compare to the coefficient on untype4 in your regression results.
m = lm(success ~ untype4 + treaty +  wartype + decade + 
                factnum + logcost +  wardur + trnsfcap + develop + exp,
       data = ds2000)

cf_ds2000 = ds2000 %>% as.data.frame
cf_ds2000$untype4 =  1*!(cf_ds2000$untype4)

ds2000[, y_hat := predict(m, newdata = cf_ds2000)]
ds2000[, y1 := ifelse(untype4 %in% 1, success, y_hat)]
ds2000[, y0 := ifelse(untype4 %in% 0, success, y_hat)]
ds2000[, tau := y1 - y0]
ds2000[, tau] %>% mean
## [1] 0.4394185

Results:

Model 1
(Intercept) 1.544*** (0.201)
untype4 0.439* (0.169)
treaty 0.302** (0.093)
wartype −0.235** (0.075)
decade −0.035 (0.026)
factnum −0.070** (0.026)
logcost −0.065*** (0.016)
wardur 0.001 (0.000)
trnsfcap 0.000 (0.000)
develop 0.000 (0.000)
exp −0.981* (0.440)
Num.Obs. 122
R2 0.428
RMSE 0.36

\(i\) \(UN_i\) \(Y_i(1)\) \(Y_i(0)\) \(\tau_i\)
111 0 \(\color{red}{E[Y(1) | D = 0, X = x]}\) 0.00 \(\color{red}{?}\)
112 0 \(\color{red}{E[Y(1) | D = 0, X = x]}\) 0.00 \(\color{red}{?}\)
113 0 \(\color{red}{E[Y(1) | D = 0, X = x]}\) 0.00 \(\color{red}{?}\)
114 0 \(\color{red}{E[Y(1) | D = 0, X = x]}\) 1.00 \(\color{red}{?}\)
115 0 \(\color{red}{E[Y(1) | D = 0, X = x]}\) 1.00 \(\color{red}{?}\)
116 1 0.00 \(\color{red}{E[Y(0) | D = 1, X = x]}\) \(\color{red}{?}\)
117 1 1.00 \(\color{red}{E[Y(0) | D = 1, X = x]}\) \(\color{red}{?}\)
118 1 1.00 \(\color{red}{E[Y(0) | D = 1, X = x]}\) \(\color{red}{?}\)
119 1 1.00 \(\color{red}{E[Y(0) | D = 1, X = x]}\) \(\color{red}{?}\)
120 1 1.00 \(\color{red}{E[Y(0) | D = 1, X = x]}\) \(\color{red}{?}\)
121 1 1.00 \(\color{red}{E[Y(0) | D = 1, X = x]}\) \(\color{red}{?}\)
122 1 1.00 \(\color{red}{E[Y(0) | D = 1, X = x]}\) \(\color{red}{?}\)
\(i\) \(UN_i\) \(Y_i(1)\) \(Y_i(0)\) \(\tau_i\)
111 0 \(\color{red}{\widehat{\beta_0} + \widehat{\beta_{D}} \cdot 1 + \mathbf{X_i\widehat{\beta_{X}}}}\) 0.00 \(\color{red}{?}\)
112 0 \(\color{red}{\widehat{\beta_0} + \widehat{\beta_{D}} \cdot 1 + \mathbf{X_i\widehat{\beta_{X}}}}\) 0.00 \(\color{red}{?}\)
113 0 \(\color{red}{\widehat{\beta_0} + \widehat{\beta_{D}} \cdot 1 + \mathbf{X_i\widehat{\beta_{X}}}}\) 0.00 \(\color{red}{?}\)
114 0 \(\color{red}{\widehat{\beta_0} + \widehat{\beta_{D}} \cdot 1 + \mathbf{X_i\widehat{\beta_{X}}}}\) 1.00 \(\color{red}{?}\)
115 0 \(\color{red}{\widehat{\beta_0} + \widehat{\beta_{D}} \cdot 1 + \mathbf{X_i\widehat{\beta_{X}}}}\) 1.00 \(\color{red}{?}\)
116 1 0.00 \(\color{red}{\widehat{\beta_0} + \widehat{\beta_{D}} \cdot 0 + \mathbf{X_i\widehat{\beta_{X}}}}\) \(\color{red}{?}\)
117 1 1.00 \(\color{red}{\widehat{\beta_0} + \widehat{\beta_{D}} \cdot 0 + \mathbf{X_i\widehat{\beta_{X}}}}\) \(\color{red}{?}\)
118 1 1.00 \(\color{red}{\widehat{\beta_0} + \widehat{\beta_{D}} \cdot 0 + \mathbf{X_i\widehat{\beta_{X}}}}\) \(\color{red}{?}\)
119 1 1.00 \(\color{red}{\widehat{\beta_0} + \widehat{\beta_{D}} \cdot 0 + \mathbf{X_i\widehat{\beta_{X}}}}\) \(\color{red}{?}\)
120 1 1.00 \(\color{red}{\widehat{\beta_0} + \widehat{\beta_{D}} \cdot 0 + \mathbf{X_i\widehat{\beta_{X}}}}\) \(\color{red}{?}\)
121 1 1.00 \(\color{red}{\widehat{\beta_0} + \widehat{\beta_{D}} \cdot 0 + \mathbf{X_i\widehat{\beta_{X}}}}\) \(\color{red}{?}\)
122 1 1.00 \(\color{red}{\widehat{\beta_0} + \widehat{\beta_{D}} \cdot 0 + \mathbf{X_i\widehat{\beta_{X}}}}\) \(\color{red}{?}\)
\(i\) \(UN_i\) \(Y_i(1)\) \(Y_i(0)\) \(\tau_i\)
111 0 0.70 0.00 \(\color{red}{0.70}\)
112 0 0.68 0.00 \(\color{red}{0.68}\)
113 0 0.68 0.00 \(\color{red}{0.68}\)
114 0 0.87 1.00 \(\color{red}{-0.13}\)
115 0 0.97 1.00 \(\color{red}{-0.03}\)
116 1 0.00 0.10 \(\color{red}{-0.10}\)
117 1 1.00 0.41 \(\color{red}{0.59}\)
118 1 1.00 0.34 \(\color{red}{0.66}\)
119 1 1.00 0.54 \(\color{red}{0.46}\)
120 1 1.00 0.73 \(\color{red}{0.27}\)
121 1 1.00 0.33 \(\color{red}{0.67}\)
122 1 1.00 0.47 \(\color{red}{0.53}\)

Conditioning with Regression

How it works:

  • Rather than “fill” missing potential outcomes of \(Y(d)\) with observed outcomes that cases that match exactly on \(X\), we fit the linear conditional expectation function, plug in the predicted value for counterfactual potential outcome
  • Rather than hold confounders constant, we ensure zero linear association between confounders and \(D\).

Conditioning with Regression

Assumptions

  1. Conditional Independence: within strata of \(X\), potential outcomes of \(Y\) must be independent of cause \(D\) (i.e. within values of \(X\), \(D\) must be as-if random)
  • \(\epsilon_i\) contributes to \(Y_i\), so determines potential outcomes. Thus, this assumption is equivalent to \(D_i \perp \!\!\! \perp \epsilon_i | X_i\).
  • we specify the relationship between \(X, D\), and \(X,Y\) to be linear and additive. So we must assume that any dependence is linear and additive

Conditioning with Regression

Assumptions

  1. Conditional Independence: Questions that arise when using regression:
  • what if we’ve forgotten a variable?
  • what if things are not linear and additive?

What if we forgot a variable?

If the true process generating the data is:

\[Y_i = \beta_0 + \beta_1 D_i + \beta_2 X_i + \nu_i\]

with \((D_i,X_i) \perp \!\!\! \perp \nu_i\), \(E(\nu_i) = 0\)

What happens when we estimate this model with a constant and \(D_i\) but exclude \(X_i\)?

\[Y_i = \beta_0 + \beta_1 D_i + \epsilon_i\]

\[\small\begin{eqnarray} \widehat{\beta_1} &=& \frac{Cov(D_i, Y_i)}{Var(D_i)} \\ &=& \frac{Cov(D_i, \beta_0 + \beta_1 D_i + \beta_2 X_i + \nu_i)}{Var(D_i)} \\ &=& \frac{Cov(D_i, \beta_1 D_i)}{Var(D_i)} + \frac{Cov(D_i,\beta_2 X_i)}{Var(D_i)} + \frac{Cov(D_i,\nu_i)}{Var(D_i)} \\ &=& \beta_1\frac{Var(D_i)}{Var(D_i)} + \beta_2\frac{Cov(D_i, X_i)}{Var(D_i)} \\ &=& \beta_1 + \beta_2\frac{Cov(D_i, X_i)}{Var(D_i)} \end{eqnarray}\]

So, \(E(\widehat{\beta_1}) \neq \beta_1\), it is biased

Omitted Variable Bias

When we exclude \(X_i\) from the regression, we get:

\[\widehat{\beta_1} = \beta_1 + \beta_2\frac{Cov(D_i, X_i)}{Var(D_i)}\]

This is omitted variable bias

  • recall: \(\beta_1\) is the effect of \(D\) on \(Y\)
  • recall: \(\beta_2\) is the effect of \(X\) on \(Y\)

Excluding \(X\) from the model: \(\widehat{\beta_1} = \beta_1 + \beta_2\frac{Cov(D_i, X_i)}{Var(D_i)}\)

What is the direction of the bias when:

  1. \(\beta_2 > 0\); \(\frac{Cov(D_i, X_i)}{Var(D_i)} < 0\)

  2. \(\beta_2 < 0\); \(\frac{Cov(D_i, X_i)}{Var(D_i)} < 0\)

  3. \(\beta_2 > 0\); \(\frac{Cov(D_i, X_i)}{Var(D_i)} > 0\)

  4. \(\beta_2 = 0\); \(\frac{Cov(D_i, X_i)}{Var(D_i)} > 0\)

  5. \(\beta_2 > 0\); \(\frac{Cov(D_i, X_i)}{Var(D_i)} = 0\)

Omitted Variable Bias

This only yields bias if two conditions are true:

  1. \(\beta_2 \neq 0\): omitted variable \(X\) has an effect on \(Y\)

  2. \(\frac{Cov(D_i, X_i)}{Var(D_i)} \neq 0\): omitted variable \(X\) is correlated with \(D\). (on the same backdoor path)

This is why we don’t need to include EVERYTHING that might affect \(Y\) in our regression equation; only those variables that affect treatment and the outcome.

Omitted Variable Bias

Link to DAGs:

  • OVB solved when we “block” backdoor paths from \(D\) to \(Y\).

Link to Conditional Independence:

  • OVB is a result of conditional independence assumption being wrong

Link to linearity:

  • OVB formula only describes bias induced by exclusion of a confounder that has a linear relationship with \(D\) and \(Y\).

Example:

Let’s turn to a different example: do hours increase earnings?

Let’s say that we estimate this model, and we have included all possible confounders.

\[\begin{eqnarray}Y_i = \beta_0 + \beta_1 Hours_i + \beta_2 Female_i + \\ \beta_3 Age_i + \beta_4 Law_i + \epsilon_i\end{eqnarray}\]

And we want to find \(\beta_1\) with \(\widehat{\beta_1}\)

  • how are we assuming linearity?
  • how are we assuming additivity?

Example:

If we are imputing missing potential outcomes of earnings for different hours worked using this model…

We need to ask…

  • Are relationships between age (\(X\)) and hours(\(D\)), age (\(X\)) and income earned (\(Y\)) linear?
  • Are the earnings (\(Y\)) related to gender and age (\(X\)) in an additive way? (relationship between age and earnings the same for men and women?)
  • Are the effects of hours worked linear? The same, regardless of gender and age?

Example: D and X

Y and X

D and Y

Assuming additivity and linearity:

m_linear_additive = lm(INCEARN ~ UHRSWORK + Gender + AGE + LAW, data = acs_data) 

Linear/Additive
Hours Worked 1076*** (115)
Male 37223*** (2800)
Age (Years) 3453*** (125)
Law −48920*** (2675)
Num.Obs. 10000
R2 0.146
RMSE 127723.58

  • How do we interpret the intercept?

Non-linear dependence between D and X, despite regression

Assuming linearity, not additivity: linear relationship between Age and hours/earnings varies by gender and profession

m_linear_interactive = lm(INCEARN ~ UHRSWORK + Gender*AGE*LAW, data = acs_data)

Linear/Additive  Linear/Interactive
Hours Worked 1076*** (115) 1175*** (114)
Male 37223*** (2800) 51887** (18118)
Age (Years) 3453*** (125) 4985*** (341)
Law −48920*** (2675) 105972*** (19357)
Num.Obs. 10000 10000
R2 0.146 0.158
RMSE 127723.58 126847.51

Non-linear dependence between D and X, despite regression

Assuming neither linearity or additivity: fit an intercept for every combination of gender, profession, and age in years. (technically linear, but not an assumption)

m_full = lm(INCEARN ~ UHRSWORK + as.factor(Gender)*as.factor(AGE) + LAW, data = acs_data)

Linear/Additive  Linear/Interactive  Nonlinear/Interactive
Hours Worked 1076*** (115) 1175*** (114) 1305*** (113)
Male 37223*** (2800) 51887** (18118)
Age (Years) 3453*** (125) 4985*** (341)
Law −48920*** (2675) 105972*** (19357) −46702*** (2614)
Num.Obs. 10000 10000 10000
R2 0.146 0.158 0.199
RMSE 127723.58 126847.51 123733.07

No dependence between \(D\) and \(X\), after regression

Conditional Independence in Regression

Even if we included all variables on backdoor path between \(D\) and \(Y\), regression may still produce biased estimate:

  • we assume the conditional expectation function is linear and additive.
  • but the world might be non-linear and interactive
  • even absent OVB, our decisions about how to specify the regression equation can lead to bias: \(\Rightarrow\) model dependence
  • Our imputation of missing potential outcomes is biased due to choices we make about how to impute (linear/additive)